## [1] 1599 13
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
## [1] 1319 13
## [1] 63 13
## [1] 217 13
The quality distribution seems to be normal, about 82.4% of the red wines are rated as 5 and 6. Only 10 wines are quality 3 and 15 wines are quality 8. My initial thought is that the variables that have strong impact on wine quality should also have normal distribution.
The peak is around 7, and the diagram is right skewed. So I log transform it.
It looks like the histogram of volatile.acidity has two peaks at 0.4 and 0.7, and it’s right skewed. I’m going to use a log transform.
There are some missing values at low x axis value on the transformed figure.
There is a striking high bar at zero and another one at 0.5.
It’s like a normal distribution with skewed tail. Most wines have residual sugar less than 4. I also log transform it.
This figure looks similar as the residual.sugar one. There is a high peak at around 0.08.
The histogram of free.sulfur.dioxide and total.sulfur.dioxide look the same. They are both right skewed a lot, with high count at low sulfur dioxide level.
Density and pH seem to have similar normal distribution. Most wines have density 0.997 and pH 3.4.
Sulphates has normal distribution with a right skewed tail. The peak is around 0.7. I log transform it.
There is a high peak at alcohol level around 8, and the distribution is right skewed. I log transform it.
I am going to exclusively look at good wines with quality 7 and 8, trying to figure out if they have some common characteristics.
## X fixed.acidity volatile.acidity citric.acid
## Min. : 8.0 Min. : 4.900 Min. :0.1200 Min. :0.0000
## 1st Qu.: 482.0 1st Qu.: 7.400 1st Qu.:0.3000 1st Qu.:0.3000
## Median : 939.0 Median : 8.700 Median :0.3700 Median :0.4000
## Mean : 831.7 Mean : 8.847 Mean :0.4055 Mean :0.3765
## 3rd Qu.:1089.0 3rd Qu.:10.100 3rd Qu.:0.4900 3rd Qu.:0.4900
## Max. :1585.0 Max. :15.600 Max. :0.9150 Max. :0.7600
## residual.sugar chlorides free.sulfur.dioxide
## Min. :1.200 Min. :0.01200 Min. : 3.00
## 1st Qu.:2.000 1st Qu.:0.06200 1st Qu.: 6.00
## Median :2.300 Median :0.07300 Median :11.00
## Mean :2.709 Mean :0.07591 Mean :13.98
## 3rd Qu.:2.700 3rd Qu.:0.08500 3rd Qu.:18.00
## Max. :8.900 Max. :0.35800 Max. :54.00
## total.sulfur.dioxide density pH sulphates
## Min. : 7.00 Min. :0.9906 Min. :2.880 Min. :0.3900
## 1st Qu.: 17.00 1st Qu.:0.9947 1st Qu.:3.200 1st Qu.:0.6500
## Median : 27.00 Median :0.9957 Median :3.270 Median :0.7400
## Mean : 34.89 Mean :0.9960 Mean :3.289 Mean :0.7435
## 3rd Qu.: 43.00 3rd Qu.:0.9973 3rd Qu.:3.380 3rd Qu.:0.8200
## Max. :289.00 Max. :1.0032 Max. :3.780 Max. :1.3600
## alcohol quality
## Min. : 9.20 Min. :7.000
## 1st Qu.:10.80 1st Qu.:7.000
## Median :11.60 Median :7.000
## Mean :11.52 Mean :7.083
## 3rd Qu.:12.20 3rd Qu.:7.000
## Max. :14.00 Max. :8.000
I compared the summary of all the wines and the summary of the good wines. I calculated how much the mean value for each variable has changed. Based on the results, I divided the 11 variables into four groups:
1. Strong change (>20%): volatile.acidity, citric.acid, total.sulfur.dioxide.
2. Median change (10% - 13%): chlorides, free.sulfur.dioxide, sulphates, alcohol.
3. Small change (~6%): fixed.acidity, residual.sugar.
4. Tiny change (<1%): density, pH.
Similarly, I want to create a group of wines with quality 3 & 4, and try to investigate how the variables change when the quality goes down.
## X fixed.acidity volatile.acidity citric.acid
## Min. : 19.0 Min. : 4.600 Min. :0.2300 Min. :0.0000
## 1st Qu.: 435.0 1st Qu.: 6.800 1st Qu.:0.5650 1st Qu.:0.0200
## Median : 834.0 Median : 7.500 Median :0.6800 Median :0.0800
## Mean : 837.7 Mean : 7.871 Mean :0.7242 Mean :0.1737
## 3rd Qu.:1285.5 3rd Qu.: 8.400 3rd Qu.:0.8825 3rd Qu.:0.2700
## Max. :1522.0 Max. :12.500 Max. :1.5800 Max. :1.0000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 1.200 Min. :0.04500 Min. : 3.00
## 1st Qu.: 1.900 1st Qu.:0.06850 1st Qu.: 5.00
## Median : 2.100 Median :0.08000 Median : 9.00
## Mean : 2.685 Mean :0.09573 Mean :12.06
## 3rd Qu.: 2.950 3rd Qu.:0.09450 3rd Qu.:15.50
## Max. :12.900 Max. :0.61000 Max. :41.00
## total.sulfur.dioxide density pH sulphates
## Min. : 7.00 Min. :0.9934 Min. :2.740 Min. :0.3300
## 1st Qu.: 13.50 1st Qu.:0.9957 1st Qu.:3.300 1st Qu.:0.4950
## Median : 26.00 Median :0.9966 Median :3.380 Median :0.5600
## Mean : 34.44 Mean :0.9967 Mean :3.384 Mean :0.5922
## 3rd Qu.: 48.00 3rd Qu.:0.9977 3rd Qu.:3.500 3rd Qu.:0.6000
## Max. :119.00 Max. :1.0010 Max. :3.900 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.60 1st Qu.:4.000
## Median :10.00 Median :4.000
## Mean :10.22 Mean :3.841
## 3rd Qu.:11.00 3rd Qu.:4.000
## Max. :13.10 Max. :4.000
If the variables have strong impact on the wine quality, I’m expecting that their mean values will have inverse change for good and bad wines as compared to all wines. Based on this criterion, I further regroup all the 11 variables (suspected):
1. Strong impact: volatile.acidity, citric.acid.
2. Median impact: chlorides, sulphates.
3. Small impact: fixed.acidity, free.sulfur.dioxide, alcohol.
4. Tiny impact: residual.sugar, total.sulfur.dioxide, density, pH.
Surprisingly, the mean value of total.sulfur.dioxide for both good and bad wines drops more than 25% as compared to that of all wines. So we cannot rely on this parameter to decide the wine quality.
Although the above grouping is solely based on the mean value change, we are assured that volatile.acidity and citric.acid must have strong correlation with the wine quality.
Let’s compare the histogram of these two variables in all-wine group, good-wine group, and bad-wine group.
So most good wines have volatile acidity lower than 0.8, while the bad wines tend to have wider distributed and discrete volatile acidity value.
As for citric acid, a lot of good wines have the value between 0.3 and 0.7, but just a few bad wines have this range of citric acid value.
I’m interested to see how much the fixed acidity accounts for the total acidity. I assume the total acidity can be calculated as the sum of fixed acidity and volatile acidity. So I create a new variable named “fixed.acidity.percent”, which is calculated by: fixed.acidity / (fixed.acidity + volatile.acidity)
I also created a pH.bucket variable to divide pH into five groups.
There are 1599 wines in the dataset with 11 attributes that may have impact on the wine quality. All the variables are numbers. There is no NA in this dataset.
1319 out of 1599 red wines are rated as 5 and 6.
The histograms of density and pH are close to normal distribution.
There is a high peak for citric.acid equals zero.
The histograms of free.sulfur.dioxide and total.sulfur.dioxide have similar distribution, suggesting that these two variables may have strong correlation.
I suspect that volatile.acidity and citric.acid are the two major features that determine the quality of wine. The mean value of volatile acidity for good wine is 0.4055, for bad wine is 0.7242. The median value of citric acid for good wine is 0.4, while for bad wine it’s only 0.08. Some other variables might have minor impact on the wine quality.
Chlorides, sulphates, fixed.acidity, free.sulfur.dioxide, and alcohol might have median or small impact on the quality of wine.
I created a new variable named “fixed.acidity.percent” because I’m interested to see how much the fixed acidity accounts for the total acidity, which may have influence on the wine quality.
I created quality.bucket variable to divide the wine into three groups based on their quality. I also created a pH.bucket variable to divide pH into five groups.
I noticed that the histogram of volatile.acidity seems to have two distinct peaks. So I log-transformed it to make these two peaks more clear. It looks like there’s one peak around 0.4 and another peak around 0.7. These two peaks correspond well with the mean values of the volatile.acidity for good and bad wine groups. The mean value of volatile acidity for good wine is 0.4055, for bad wine is 0.7242.
From the above scatter matrices, it turns our that the correlation coefficients of quality versus volatile.acidity, citric.acid, sulphates, and alcohol are higher than other variables.
What also interest me are the following pairs of variables that have strong correlation (> 0.5):
1. free.sulfur.dioxide v.s. total.sulfur.dioxide
2. fixed.acidity v.s. density, pH
Next, I want to look at the boxplots involving quality and other variables.
##
## Pearson's product-moment correlation
##
## data: pf$volatile.acidity and pf$quality
## t = -16.954, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4313210 -0.3482032
## sample estimates:
## cor
## -0.3905578
Good wines have mean volatile acidity lower than 0.4. The correlation between volatile acidity and quality is -0.391
##
## Pearson's product-moment correlation
##
## data: pf$citric.acid and pf$quality
## t = 9.2875, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1793415 0.2723711
## sample estimates:
## cor
## 0.2263725
Mean citric.acid value for bad wines are lower than 0.2, while for good wines it’s higher than 0.3. The correlation between citric acidity and quality is 0.226
##
## Pearson's product-moment correlation
##
## data: pf$sulphates and pf$quality
## t = 10.38, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2049011 0.2967610
## sample estimates:
## cor
## 0.2513971
Good wines have higher mena sulphates values than bad wines, although the difference is not that big. The correlation between these two variables is 0.251
##
## Pearson's product-moment correlation
##
## data: pf$alcohol and pf$quality
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4373540 0.5132081
## sample estimates:
## cor
## 0.4761663
Although quality 5 wines have lower mean alcohol value than quality 4, the good wines have much higher mean alcohol value than bad wines. The correlation of these two is 0.476
##
## Pearson's product-moment correlation
##
## data: pf$fixed.acidity.percent and pf$quality
## t = 14.784, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3030968 0.3893614
## sample estimates:
## cor
## 0.3469627
Quality 7 wines have the highest mean fixed.acidity.percent.
Based on the above boxplots, volatile.acidity and citric.acid play important roles in determining the wine quality. Also, I will mainly focus on “sulphates” and “alcohol” among the median and small impact factors that I mentioned in the Univariate Analysis.
Next, I want to see the scatter plot of free.sulfur.dioxide vs total.sulfur.dioxide.
Although the relationship does not look like linear, all the points seem to be confined in a cone plane.
I also take a look at the relationship between density and pH, alcohol and volatile acidity, sulphates and alohol. Most of the wines seem to have pH ~ 3.3 and density ~ 0.996.
The majority of the red wines have alcohol level lower than 11, and volatile acidity from 0.3 to 0.4.
sulphates is mostly at 0.4 ~ 0.8.
According to the boxplots, volatile.acidity generally decreases as the quality goes up, and citric.acid increases as the quality goes up.
The correlation between volatile acidity and quality is -0.391. The correlation between citric acidity and quality is 0.226.
Additionally, good wines usually have higher sulphates and alcohol levels. The correlation between sulphates and quality is 0.251. The correlation between alcohol and quality is 0.476.
From the scatterplot of alcohol vs sulphastes, I noticed that the majority of the red wines have alcohol level lower than 11 (% by volume), and sulphates lower than 0.6 g/dm^3.
The variation of volatile acidity looks bigger than that of sulphates.
Fixed.acidity.percent increases as the quality goes up.
Most of the wines seem to have pH ~ 3.3 and density ~ 0.996, according to the scatter plot of density and pH.
The two features that influence the quality most are confirmed to be volatile.acidity and citric.acid. The correlation between volatile acidity and quality is -0.391. The correlation between citric acidity and quality is 0.226.
Quality 7 wines have volatile acidity around 0.4 and citric acid around 0.35. For quality 5 & 6, they cover quite a large range of citric acid value.
I compare the volatile acidity and citric acid for good and bad wines exclusively. I notice that most good wines have volatile acidity from 0.25 to 0.5, and citric acid from 0.2 to 0.6. The bad wines points are more scattered, but they tend to have volatile acidity more than 0.5, and citric acid lower than 0.3.
For good wines the sulphates value is mostly from 0.6 to 0.9, the alcohol value from 10 to 13. For bad wines, on the other hand, the sulphates value is mostly from 0.25 to 0.65, and the alcohol value from 9 to 11.5.
##
## Pearson's product-moment correlation
##
## data: pf$pH and pf$quality
## t = -2.3109, df = 1597, p-value = 0.02096
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.106451268 -0.008734972
## sample estimates:
## cor
## -0.05773139
We can see different pH buckets gathered at quality 5, 6, 7, and 8. There’s no clear relationship between pH and quality, which indicates that pH is not a good variable to divide good and bad wines.
It is confirmed that higher citric.acid and lower volatile.acidity contribute towards better wines. Also, better wines tend to have higher sulphates and alcohol content.
From the forth plot, it turns out that pH has very little impact on wine quality, although the distribution of pH is also normal. The correlation between pH and quality is only -0.0577, less than the threshold value for two variables to be correlated.
I first choose to plot the histogram of wine quality from the data set, because this is the main variable that I’m interested to investigate. I want to know what features may change the wine quality. The quality distribution seems to be normal, about 82.4% of the red wines are rated as 5 and 6. 63 wines are of quality 3 & 4, and 217 wines are of quality 7 & 8. So we can define bad wines group to have quality lower than 5, and good wines group with quality higher than 6.
Citric acid is one of the major variables that I suspect to have strong impact on wine quality. So I use box plot to see the mean citric acid value, and the quantiles of different wine quality. I notice that better wines tend to have higher value of citric acid. The mean value of citric acid increases as the wine quality is getting better. It’s between 0 and 0.2 for bad wines, 0.2 ~ 0.4 for quality 5 & 6, and equal or over 0.4 for good wines. What’s more, good wines have smaller citric acid variation. This result verifies that citric acid is indeed a major contribution to wine quality.
Sulphates and alcohol values are the ohter two features that I suspect to influence the wine quality a lot. So I plot alcohol versus sulphates and use different colors to represent different wine quality. I exclusively look at good wines (quality 7 & 8, blue) and bad wines (quality 3 & 4, brown) in order to make the trend more clear. Most of the blue and dark blue points are gathered in the right corner of this plot, these points have sulphates values from 0.7 to 1.25, and alcohol value from 10 to 14. This indicates that better wines usually have higher alcohol and sulphates levels.
Through this exploratory data analysis, I identified the key features that determine the red wine quality. I learned that we must not only look at univariate plots, but also two or multiple variables to carefully inspect different possibilities. For example, the normal distributed pH gave me a feel that it might affect the wine quality a lot as the histogram of quality is also normal. However, after looking at the boxplot of pH vs quality, it turned out that pH does not have that strong correlation with wine quality. Therefore, we need to verify our idea through in-depth research.
I improved my EDA skills a lot through this study. I learned that better analysis can be generated by removing the extreme outliers in the data. I learned that giving clear statistics along with proper plots can enhance the analysis. I also learned that detailed description is an important part of EDA. I spent a lot of time adding more comments and extending my discussion on the plots so that my ideas can be better conveyed through the reports.
It is proved that there are four factors that mainly involved in the determination of quality: citric.acid, volatile.acidity, alcohol, and sulphates. It is important to note, however, that wine quality is subjective to vary as different wine experts may have different tastes. It would be better to know the background of these wine experts, as experts from France and India may have different standards on evaluating wine quality. Also, as we see from the histogram of wine quality, it is definitely not a perfect normal-distribution. It would be a great help if the experts can give a more precise scale, for example, 3, 3.5, 4, … , 7, 7.5, 8. That way, this data set may generate more convincing results.